Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

santoshborse · 2025-02-10T19:27:58Z

Why are these changes needed?

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files

Related issue number (if any).

shahrokhDaijavad · 2025-02-11T00:36:14Z

Thank you very much, @santoshborse. I tested the transform by running make run-cli-sample-python successfully.
There are some things missing like a test directory that we use for CI/CD testing and I will make the README more consistent with the other READMEs, but for now, one small change in README:

make run-cli-sample => make run-cli-sample-python
and another urgent request for a simple notebook that @Hajar-Emami can use to mimic in her GneissWeb recipe notebook.
An example of such a minimum notebook is this one:
https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/tokenization.ipynb
i.e.,

Show the CLI option table that you have in README
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
Tokenization2Arrow(input_folder= "test-data/ds02/input",
output_folder= "output",
.... other CLI arguments).transform()
import glob
glob.glob("output/*")

cmadam

One problem with this transform is that it has no tests:

$ make test-src
...
source venv/bin/activate;       \
export PYTHONPATH=../src:../: ;  \
cd test; pytest -s .
/bin/bash: line 3: cd: test: No such file or directory
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/cma/de/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collected 0 items                                                                                                                                                                                                                           

=========================================================================================================== no tests ran in 0.01s ===========================================================================================================
make: *** [../../../.make.defaults:442: .defaults.test-src] Error 5

Please add some tests to the transform.

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform.py

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_python.py

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_ray.py

transforms/universal/tokenization2arrow/Dockerfile.python

touma-I · 2025-02-11T12:31:38Z

transforms/universal/tokenization2arrow/Dockerfile.ray

Can you use the content from here ? https://github.com/IBM/data-prep-kit/blob/tokenization2arrow_transform/transforms/Dockerfile.ray.template

touma-I · 2025-02-11T12:34:40Z

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_ray.py

For consistency, it may be better to use ray.transform. This will make it easier for maintaining the module

I think because we have transform_python which we use as,
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

I makes more sense to have transform_ray, but if you insist I will change that

@santoshborse The latest convention is to use dpk_<transform>.runtime (instead of dpk_.transform_python) and dpk_<transform>.ray.runtime . It will be nice if you can stay with the convention. It makes suppport easier and there does not seem to be a good reason not to.

touma-I · 2025-02-11T12:38:37Z

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform.py

+        # TODO: check if we should add anything to tokenization_metadata
+        return [(bos.getvalue().to_pybytes(), ".arrow")], tokenization_metadata
+
+    def transform_binary(self, file_name: str, byte_array: bytes) -> tuple[list[tuple[bytes, str]], dict[str, Any]]:


Not clear why this is implementing transform_binary() and not transform(). I think the code structure will be easier to understand/maitain if you implement transform() and then call super.transform() before calling transforms_to_arrow() .

transform_binary returns - tuple[list[tuple[bytes, str]], dict[str, Any]]:
transform returns - tuple[list[pa.Table], dict[str, Any]]:

I am using transform_binary so that I can return data in bytes ( so that f/w can write .arrow files )

touma-I

Please provide one or more Unit Test in the test folder.

touma-I · 2025-02-11T17:20:09Z

transforms/universal/tokenization2arrow/requirements.txt

Let's discuss how we can redo this one. Maybe 2 requirements.txt, one that is used as part of the packaging and one that is used for pulling the dependency on the tokenization. Also, have you considered making this module as part of the tokenization module ? Would it be easier for inheritance for this module to be an extension on the tokenization rather than its own ?

touma-I

added the module to pyproject.toml for when building the wheel.

shahrokhDaijavad · 2025-02-11T20:16:12Z

Thanks for adding the notebook, @santoshborse!

touma-I · 2025-03-03T23:40:22Z

transforms/universal/tokenization2arrow/Makefile

@santoshborse This file is missing the following:
TRANSFORM_PYTHON_SRC=
TRANSFORM_RAY_SRC=

see https://github.com/IBM/data-prep-kit/blob/dev/transforms/Makefile.transform.template for example. This is also related to the comment above on transform_python and transform_ray. Let's follow the convention since there is no really good reason not to.

ok, I will make changes, I followed https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/Makefile

hi @touma-I I have updated the module names and Makefile as you asked.

@santoshborse Thank you! This looks good. Sorry about the confusion. I am hoping in the next iteration we will simplify things further and get rid of a few constraints. Please stay tuned. I might also reach out to bounce off a few ideas.

touma-I

@santoshborse The two additional comments I added should address the failed CI/CD. Please reach out if anything is not clear.

Signed-off-by: Santosh Borse <[email protected]>

Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Santosh Borse <[email protected]>

Signed-off-by: Santosh Borse <[email protected]>

Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Santosh Borse <[email protected]>

Signed-off-by: Santosh Borse <[email protected]>

touma-I · 2025-03-05T23:43:11Z

transforms/universal/tokenization2arrow/Makefile

@santoshborse Thank you! This looks good. Sorry about the confusion. I am hoping in the next iteration we will simplify things further and get rid of a few constraints. Please stay tuned. I might also reach out to bounce off a few ideas.

touma-I · 2025-03-05T23:50:03Z

@shahrokhDaijavad Can you review the readme.md and notebooks before I merge? Thanks

shahrokhDaijavad

@santoshborse In the two notebooks, can you please change
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
to
from dpk_tokenization2arrow.runtime import Tokenization2Arrow

and
from dpk_tokenization2arrow.transform_ray import Tokenization2Arrow
to
from dpk_tokenization2arrow.ray.runtime import Tokenization2Arrow

Also,
In the makefile, don't we need to make the same change for the 2 targets:
dpk_$(TRANSFORM_NAME).transform_python => dpk_$(TRANSFORM_NAME).runtime
and
dpk_$(TRANSFORM_NAME).transform_ray => dpk_$(TRANSFORM_NAME).ray.runtime

After these changes, if you don't mind, I will make a few small changes in the README file myself.

Signed-off-by: Santosh Borse <[email protected]>

santoshborse · 2025-03-06T03:21:26Z

@shahrokhDaijavad updated the notebook and makefile, also tested locally to make sure notebook and both commands in makefile works.

transforms/universal/tokenization2arrow/Makefile

Signed-off-by: Santosh Borse <[email protected]>

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad

Thanks, @santoshborse. I made a few changes to the README file.

santoshborse requested review from shahrokhDaijavad and touma-I February 10, 2025 19:27

santoshborse force-pushed the tokenization2arrow_transform branch from 4231f1d to b0d1490 Compare February 10, 2025 19:30

touma-I requested a review from cmadam February 10, 2025 19:38

cmadam reviewed Feb 11, 2025

View reviewed changes

touma-I reviewed Feb 11, 2025

View reviewed changes

transforms/universal/tokenization2arrow/Dockerfile.python Show resolved Hide resolved

touma-I reviewed Feb 11, 2025

View reviewed changes

touma-I requested changes Feb 11, 2025

View reviewed changes

touma-I reviewed Feb 11, 2025

View reviewed changes

touma-I requested changes Mar 3, 2025

View reviewed changes

santoshborse and others added 10 commits March 5, 2025 14:25

First version of Tokenization2Arrow Transform

b819e5b

Signed-off-by: Santosh Borse <[email protected]>

enable workflow

522abf8

Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Santosh Borse <[email protected]>

Adding Docker file

a93bcde

Signed-off-by: Santosh Borse <[email protected]>

Adding 2 tests

bc04e5d

Signed-off-by: Santosh Borse <[email protected]>

Adding python test

1b4c6b6

Signed-off-by: Santosh Borse <[email protected]>

Added tokenization2arrow module

7fee472

Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Santosh Borse <[email protected]>

Adding example notebook

6f9d911

Signed-off-by: Santosh Borse <[email protected]>

Remvoving explicit install of data-prep-toolkit

2cc4025

Signed-off-by: Santosh Borse <[email protected]>

Adding Ray based notebook

6ef47a9

Signed-off-by: Santosh Borse <[email protected]>

change the Ray module name to match with project convention

d201335

Signed-off-by: Santosh Borse <[email protected]>

santoshborse force-pushed the tokenization2arrow_transform branch from 686fa79 to d201335 Compare March 5, 2025 19:25

shahrokhDaijavad mentioned this pull request Mar 5, 2025

[Feature] Filter both the parquet and arrow files and update the metadata simultaneously #1044

Open

2 tasks

touma-I approved these changes Mar 5, 2025

View reviewed changes

shahrokhDaijavad requested changes Mar 6, 2025

View reviewed changes

Fixing notebook and makefiles to have correct main class name

3863b78

Signed-off-by: Santosh Borse <[email protected]>

touma-I requested changes Mar 6, 2025

View reviewed changes

transforms/universal/tokenization2arrow/Makefile Outdated Show resolved Hide resolved

transforms/universal/tokenization2arrow/Makefile Outdated Show resolved Hide resolved

Fixing makefile

f4c54fc

Signed-off-by: Santosh Borse <[email protected]>

touma-I self-requested a review March 6, 2025 18:31

touma-I approved these changes Mar 6, 2025

View reviewed changes

Some changes in README

4397dfe

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad approved these changes Mar 6, 2025

View reviewed changes

touma-I merged commit 261c925 into dev Mar 6, 2025
7 checks passed

shahrokhDaijavad mentioned this pull request Mar 9, 2025

Tokenizing parquet files to arrow tables #1009

Closed

2 tasks

santoshborse deleted the tokenization2arrow_transform branch May 28, 2025 19:31

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

Uh oh!

Conversation

santoshborse commented Feb 10, 2025

Why are these changes needed?

Related issue number (if any).

Uh oh!

shahrokhDaijavad commented Feb 11, 2025

Uh oh!

cmadam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

touma-I Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

touma-I left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

touma-I left a comment

Choose a reason for hiding this comment

Uh oh!

shahrokhDaijavad commented Feb 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

touma-I left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

touma-I commented Mar 5, 2025

Uh oh!

shahrokhDaijavad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

santoshborse commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

touma-I Mar 3, 2025 •

edited

Loading

shahrokhDaijavad left a comment •

edited

Loading